The Efficacy of Covid Vaccines

By Ethan Xu, Emily Heiss, and Brandon Quan

Introduction

The purpose of this page is to guide you through the Data Science process and answer some essential questions about the COVID-19 pandemic.

Around the world, there is a vast amount of "vaccine hesitancy". Vaccine hesitancy is defined by the CDC as individuals who state they are "unsure", "probably won't", or "definitely won't" receive the vaccine. You can look at the CDC's visualization and article to understand the extent of vaccine hesitancy in the United States, but many of you will be familiar with the status of the vaccine debate in the United States. What about the rest of world? Who's right in this debate? How do the vaccines affect the rate of cases and severity of symptoms? Perhaps the numbers coming from the producers of the vaccines aren't enough to convince you one way or other. We believe we can help you take a more informed position through the use of data science.

Table of Contents

  1. Data Collection
  2. Cleaning the Data
  3. Exploratory Analysis and Visualizations
  4. Model: Hypothesis Testing and Machine Learning
  5. Conclusion

Data Collection

Before we get into the data science, let's go over some essential technology we'll be working with. This whole project was created in jupyter notebook. Jupyter notebook supports many different languages, but we're going to use the one you've probably heard the most about Python3.

Python has several libraries that help us immensely. A list has been provided below, but don't feel like you have to read through every link. Understanding every piece of tech isn't essential.

  1. Pandas
  2. NumPy
  3. Seaborn
  4. Sklearn

We use more than just the four libraries above, but they are the most essential.

Now onto the science! The first step of the Data Science process is data collection. We're going need some data to analyze. At this point, it's important to consider what question are we asking? We'll start off with something simple. How do the most common COVID-19 vaccines affect the number and severity of cases?

In order to answer this question, we're going to grab some data from the WHO (World Health Organization). Click here to see a nice table the WHO has provided on Covid around the world. We're going to download some of the csv's the WHO used to create this data.

Please note: You can download the csv's yourself and follow along in your own notebook, or you can follow this link to a our google drive and run this notebook yourself! Everything's already there, the csv's are downloaded and all the code is written. We heavily recommend this option.

Let's begin by importing the necessary libraries.

Next we're going to get the data out of the csv's and into a Pandas dataframe.

Cleaning the Data

We've just collected a lot of data from the internet, but it's a bit messy. It's missing values. There's variables we don't need. It's hard to read and difficult to work with when we're coding. We've reached the second part of the data science process: Cleaning the data

Let's go ahead and drop the variables we don't need

We've dropped what we don't need. Let's merge these tables together so all the data is in one place.

One of those columns vaccines_used is actually a common mistake. It's got several variables in one column. Let's spread that out and turn it into numbers we can use

Much nicer and now we can use the type of vaccine used to help build our model.

Next we have to deal with missing data. This a problem all data scientists deal with. We can take a simple route and simply remove the rows which have missing data, or we can go further into imputation. Imputation is the practice of replacing the missing data with made-up data based on the rest of table. The different techniques can get a little complicated, so we won't go over all of them. First let's get an idea of how much data is missing.

There are 222 rows in our dataset. Above you can see how many values are missing in each column. For example, in the first column theres 0 values missing, in the third column there's 3 values missing. In one of the columns "FIRST_VACCINE_DATE" there's 16 values missing. All in all, not that much is missing.

Now, a lot of these nations that are missing data on when the first vaccine was administered are very small often isolated countries that aren't that useful to our dataset. Let's go ahead and drop the nations this applies to.

We still have some missing data. Looks like the missing data is in the columns "PERSONS_VACCINATED_1PLUS_DOSE", "PERSONS_VACCINATED_1PLUS_DOSE_PER100", "PERSONS_FULLY_VACCINATED", and "PERSONS_FULLY_VACCINATED_PER100". We'd like to fill those in. Let's go ahead and do a Hot Deck Imputation. Hot Deck imputation is the process of imputing values from observations that are most similar to the observations missing data. Basically, for every row with missing data, we're gonna find the observation most similar and copy over the values.

How do we measure similarity? We'll it appears we're trying to impute data related to vaccination rates. So to evaluate similarity, we're going to compare values for total vacciantions per 100,000 people.

Data has been imputed!

Look! no more missing values. Let's start doing some real analysis...

Exploratory Analysis and Visualizations

It's time for some "exploratory analysis" and visualizations. We're going to look for patterns in the data, view relationships between variables, analyze the skew, and maybe transform the dataset.

Expect to see a lot of graphs!

We're going to be using Seaborn and matplotlib for much of the graphing.

First, let's start with a simple bar graph. We'll try and get a sense of how the pandemic has affected the world. Let's make a graph documenting the 30 countries with the highest cumulative deaths.

These countries you see above lost the most. Of course this will mostly include the nations with larger populations, but COVID-19 is a global problem. We believe the scale of the tragedy should be considered globally. You can see above that millions and millions of people have died because of the pandemic.

Let's see how the world has fought this tragedy. Let's look at the countries which have vaccinated the most.

Above we can see that billions of vaccinations have been distributed around the globe. Obviously, the world has made a great effort. One interesting thing to make note of is how the countries with the highest cumulative deaths also make the list of countries with highest total vaccinations.

This data is interesting to observe, but it's time for real analysis. We're going to have the explore relationships between variables in order to create a model.

Let's look at how vaccinations rates correlate with recent cases.

Hmm, it looks like the correlation between cases and fully vaccinated individuals appears to be positive. This means that, the more fully vaccinated individuals there are, the more cases appear.

This could have to do with population sizes.

Another thing we are aware of is that the COVID-19 vaccine is not 100% effective at preventing infection. It is still possible to get a breakthrough case.

We do know however, that the vaccine prevents hospitalizaton and death

Let's take a look at how the vaccine rate and death rates correlate.

It still is a rather positive correlation, although the slope definitley is less steep.

This could be happening for a variety of reasons, such as population density.

Let's see how total vaccinations compare to fully vaccinated.

It still appears to be the same. Lets graph based on just the 30 countries with the highest death totals to see if vaccines at least helped alleviate that death rate.

Sucess! It appears that the vaccines do have some proof of lowering the death rate.

Model: Hypothesis Testing and Machine Learning

Let's make some models!

At this point we want to use some Linear Regression in order to obtain a predictive model of our data. Once our model is made we can predict values for our data that don't exist. For instance, we can see how an even HIGHER vaccine rate would look for a country. OR on the other hand, we could see what it would look like if a country was not vaccinated at all.

Let's see how the model does with deaths rather than cases

Now let's revist the smaller 30 country data set to see how the model behaves.

Conclusion

After going through the Data Science process and further exploring the correlation between vaccinations and COVID-19 cases/deaths, we were able to see that data analysis can lead to suprising results.

Going into our exploration we had expected a clear negative correlation between vaccine rates and deaths/cases due to COVID-19, but that was not the case. It was only when we isolated certain countries that we were able to gather a negatice correlation.

The COVID-19 virus is still novel. With an increase in new information and varients gathered each day, it is still tough to make predictions and models on its behavior. When we first began this process, the Omnicron variant had not existed to public knowledge, and now as we finish it, we are learning that this variant is infecting even those that are vaccinated.

If we were to do this again we would try to include many more variables, since this is an everchanging virus. We would add a city density variable to account for the mass amount of vaccines and death rates that we believe to have caused the positive correlation. We also would add some measurment of mask wearing, weather, etc.